Vcf int64 part2 #1004

jkbonfield · 2019-12-11T12:21:57Z

NOTE: this is branched off my vcf_int64 PR #1000 so it contains that commit also.

This adds extra checks to vcf parsing to set a flag to indicate the record contains 64-bit data which can only be written back to VCF, and not to BCF.

It's probably not complete (I don't check FORMAT data, only INFO and POS), but I'm not sufficiently familier with the vcf parser and format to know where 64-bit data can occur. I think it's just POS and INFO (for END)?

Anyway, this is a proof of principle, hence in its own separate PR so we can merge the other one if needed without too many shenanigans.

Any 64-bit INFO field that wasn't the last in the list would cause subsequent fields to be decoded incorrectly. This commit fixes that, plus updates the tests accordingly so the bug could be triggered. Fixes the first part of samtools#999 (test1.vcf), but doesn't fix the second part (BCF output silently being broken). Fixes samtools/bcftools#1123

daviesrob · 2019-12-11T14:43:59Z

You should also check v->pos and v->rlen as they could both contain 64-bit values that cannot be stored in BCF.

To reduce concerns about change in behaviour due to 64 bit INFO tags, it should be fairly easy to restrict them to just the END tag. There's already a strcmp(key, "END") in vcf_parse so it can detect the need to adjust v->rlen. If that was added to the if (n_val == 1) condition, END tags could take 64 bit values while all others would work as before.

pd3 · 2019-12-11T15:13:01Z

This goes in the right direction, but should build on top of #1003. I agree with forbidding writing out BCFs with 64-bit values. But the problem remains: one out-of-range value will turn on 64-bit mode and will make it impossible to write out an otherwise valid BCF. It should not be compiled in by default, unless you know you are going to work with large genomes. There are terabytes of invalid VCF/BCFs sitting on sanger's drives, but I have not seen a single 64-bit genome yet. It should be the rare case that gets inconvenienced, not the major use case.

vcf.c

jkbonfield · 2019-12-11T16:14:37Z

If the problem really is that endemic, that we have thousands of invalid files and that we must be able to transform these invalid files in / out of BCF silently (with corresponding broken data due to lack of support), then for this immediate problem we should simply disable the 64-bit VCF code so we can take time to do things "right" for the next release (where "right" probably means issue a warning and transform the values to 'missing').

However that's also not your PR which amended the BCF format.

Do you have a simple PR that does this? We don't have time to do something complex as it will require a huge amount of testing which simply won't get done, at least not by me, due to insufficient time this year.

pd3 · 2019-12-11T16:18:12Z

Hm, I did consider the PR simple. Anything outside 64bit branch is just turning large values into missing ones and printing a warning. I have tests in bcftools ready, which are waiting for the change in htslib.

Also adds check for rlen (END) field when trying to output to BCF. Note: this will emit a warning per out-of-range value, but it is expected that the frequency of these is very low. If not, consider this a good icentive to fix your pipelines!

jkbonfield · 2019-12-12T10:17:41Z

@pd3 how about this latest commit?

It is still simple. It's only 20-odd lines of new code which is about 1/3rd the size of your PR and contains fewer problematic things (like mixing long with int64_t, strtol vs strtoll, and creation of non-compliant BCF files) while offering better diagnostics and reporting to the user.

It takes @daviesrob's idea of only permitting the internal representation on the pos field and the END info tag. All other out of range info tags now get converted to "missing" and a warning is issued. It also tracks when the VCF is not representable in BCF and refuses to write one out. That can only possibly happen when BCF itself cannot be used, so it is not a regression in any shape or form.

valeriuo · 2019-12-16T15:58:25Z

Superseded by #1003

jkbonfield added 2 commits December 9, 2019 14:17

Additional checks to prevent writing invalid BCF containing 64-bit data

66c66f2

jkbonfield mentioned this pull request Dec 11, 2019

Jkbonfield vcf int64 #1003

Closed

pd3 reviewed Dec 11, 2019

View reviewed changes

vcf.c Show resolved Hide resolved

jkbonfield mentioned this pull request Dec 12, 2019

bcftools 1.10 Unexpected type 0 samtools/bcftools#1123

Closed

valeriuo closed this Dec 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vcf int64 part2 #1004

Vcf int64 part2 #1004

jkbonfield commented Dec 11, 2019 •

edited

Loading

daviesrob commented Dec 11, 2019

pd3 commented Dec 11, 2019

jkbonfield commented Dec 11, 2019

pd3 commented Dec 11, 2019

jkbonfield commented Dec 12, 2019

valeriuo commented Dec 16, 2019

Vcf int64 part2 #1004

Vcf int64 part2 #1004

Conversation

jkbonfield commented Dec 11, 2019 • edited Loading

daviesrob commented Dec 11, 2019

pd3 commented Dec 11, 2019

jkbonfield commented Dec 11, 2019

pd3 commented Dec 11, 2019

jkbonfield commented Dec 12, 2019

valeriuo commented Dec 16, 2019

jkbonfield commented Dec 11, 2019 •

edited

Loading